{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Other Helper Methods\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Table Joins\n",
    "\n",
    "ParquetDB provides a custom `join_tables` function that extends beyond the built-in [PyArrow `join`](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.join) method, allowing you to handle **custom extension types** and **more complex data types** in your joins. \n",
    "\n",
    "This notebook demonstrates how to perform various join operations—such as `left semi`, `right semi`, `left anti`, `right anti`, `inner`, `left outer`, `right outer`, `full outer` joins—on two PyArrow tables.\n",
    "\n",
    "\n",
    "While PyArrow’s built-in `join` is powerful, certain use cases may involve:\n",
    "\n",
    "- **Custom extension types** that PyArrow doesn’t support out-of-the-box.\n",
    "- **Complex or nested types** that require additional logic during joins (e.g., arrays of structs, custom objects, etc.).\n",
    "\n",
    "\n",
    "\n",
    "Below is the full implementation of the custom `join_tables` function. It closely mimics the logic of PyArrow’s built-in `Table.join` but adds:\n",
    "\n",
    "1. Index columns (`left_index` and `right_index`) to preserve the original row ordering.\n",
    "2. Logic to **coalesce keys** (if `coalesce_keys=True`).\n",
    "3. Automatic handling of **suffixes** for overlapping columns (`left_suffix` and `right_suffix`).\n",
    "4. The ability to seamlessly merge **custom extension types** and **complex data** that might otherwise be incompatible with the standard PyArrow join.\n",
    "\n",
    "```python\n",
    "def join_tables(\n",
    "    left_table: pa.Table,\n",
    "    right_table: pa.Table,\n",
    "    left_keys,\n",
    "    right_keys=None,\n",
    "    join_type=\"left outer\",\n",
    "    left_suffix=None,\n",
    "    right_suffix=None,\n",
    "    coalesce_keys=True,\n",
    "):\n",
    "    \"\"\"\n",
    "    Custom join operation for PyArrow Tables, accommodating complex or extension types\n",
    "    and additional logic for suffixes and metadata merging.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    left_table : pa.Table\n",
    "        The left-side table to join.\n",
    "    right_table : pa.Table\n",
    "        The right-side table to join.\n",
    "    left_keys : list or str\n",
    "        Column name(s) in the left table for the join.\n",
    "    right_keys : list or str, optional\n",
    "        Column name(s) in the right table for the join.\n",
    "    join_type : str, optional\n",
    "        Type of join to perform. E.g., 'left outer', 'right outer', 'inner', 'full outer',\n",
    "        'left semi', 'right semi', 'left anti', 'right anti'. Defaults to 'left outer'.\n",
    "    left_suffix : str, optional\n",
    "        Suffix for overlapping column names from the left table.\n",
    "    right_suffix : str, optional\n",
    "        Suffix for overlapping column names from the right table.\n",
    "    coalesce_keys : bool, optional\n",
    "        Whether to coalesce join keys if columns have null values. Defaults to True.\n",
    "```\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   id_1  id_2 field_1\n",
      "0   100    10  left_1\n",
      "1    33    12    None\n",
      "2    12    13    None\n",
      "   id_1  id_2  field_2\n",
      "0   100    10  right_1\n",
      "1     5     5     None\n",
      "2    33    13     None\n",
      "3    33    12  right_2\n",
      "   field_2  id_1  id_2 field_1\n",
      "0  right_1   100    10  left_1\n",
      "1  right_2    33    12    None\n",
      "2     None     5     5    None\n",
      "3     None    33    13    None\n"
     ]
    }
   ],
   "source": [
    "import pyarrow as pa\n",
    "from parquetdb.utils import pyarrow_utils\n",
    "\n",
    "# Construct two sample tables using ParquetDB-like logic\n",
    "left_data = [\n",
    "    {\"id_1\": 100, \"id_2\": 10, \"field_1\": \"left_1\"},\n",
    "    {\"id_1\": 33, \"id_2\": 12},\n",
    "    {\"id_1\": 12, \"id_2\": 13, \"field_2\": \"left_2\"},\n",
    "]\n",
    "\n",
    "right_data = [\n",
    "    {\"id_1\": 100, \"id_2\": 10, \"field_2\": \"right_1\"},\n",
    "    {\"id_1\": 5, \"id_2\": 5},\n",
    "    {\"id_1\": 33, \"id_2\": 13, \"extra_field\": \"right_extra\"},\n",
    "    {\"id_1\": 33, \"id_2\": 12, \"field_2\": \"right_2\"},\n",
    "]\n",
    "\n",
    "# Convert to PyArrow tables\n",
    "left_table = pa.Table.from_pylist(left_data)\n",
    "right_table = pa.Table.from_pylist(right_data)\n",
    "\n",
    "df_left = left_table.to_pandas()\n",
    "df_right = right_table.to_pandas()\n",
    "\n",
    "print(df_left)\n",
    "print(df_right)\n",
    "\n",
    "# Perform a left outer join using built-in PyArrow\n",
    "pyarrow_join_result = right_table.join(\n",
    "    left_table,\n",
    "    keys=[\"id_1\", \"id_2\"],\n",
    "    right_keys=[\"id_1\", \"id_2\"],\n",
    "    join_type=\"left outer\",\n",
    "    left_suffix=\"_right\",\n",
    "    right_suffix=\"_left\",  # reversed to illustrate differences\n",
    ")\n",
    "\n",
    "# Perform the same join with our custom join_tables\n",
    "custom_join_result = pyarrow_utils.join_tables(\n",
    "    right_table,\n",
    "    left_table,\n",
    "    left_keys=[\"id_1\", \"id_2\"],\n",
    "    right_keys=[\"id_1\", \"id_2\"],\n",
    "    join_type=\"left outer\",\n",
    "    left_suffix=\"_right\",\n",
    "    right_suffix=\"_left\",\n",
    "    coalesce_keys=True,\n",
    ")\n",
    "\n",
    "\n",
    "df_custom_join = custom_join_result.to_pandas()\n",
    "\n",
    "print(df_custom_join)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Drop Duplicates\n",
    "\n",
    "ParquetDB also provides a `drop_duplicates` function that allows you to drop duplicate rows from a PyArrow Table based on specified keys, keeping the first occurrence.\n",
    "\n",
    "\n",
    "```python\n",
    "def drop_duplicates(table, keys):\n",
    "    \"\"\"\n",
    "    Drops duplicate rows from a PyArrow Table based on the specified keys, \n",
    "    keeping the first occurrence.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    table : pyarrow.Table\n",
    "        The input table from which duplicates will be removed.\n",
    "    keys : list of str\n",
    "        A list of column names that determine the uniqueness of rows.\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    pyarrow.Table\n",
    "        A new table with duplicates removed, keeping the first occurrence \n",
    "        of each unique key combination.\n",
    "    \"\"\"\n",
    "```\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Original Table:\n",
      "   id     name  category\n",
      "0   0    Alice         1\n",
      "1   1      Bob         1\n",
      "2   2      Bob         1\n",
      "3   3  Charlie         2\n",
      "4   4    Alice         1\n",
      "\n",
      "Deduplicated Table (keeping first occurrence):\n",
      "   id     name  category\n",
      "0   0    Alice         1\n",
      "1   1      Bob         1\n",
      "2   3  Charlie         2\n"
     ]
    }
   ],
   "source": [
    "data = [\n",
    "    {\"id\": 0, \"name\": \"Alice\", \"category\": 1},\n",
    "    {\"id\": 1, \"name\": \"Bob\", \"category\": 1},\n",
    "    {\n",
    "        \"id\": 2,\n",
    "        \"name\": \"Bob\",\n",
    "        \"category\": 1,\n",
    "    },  # Duplicate combination of (name, category)\n",
    "    {\"id\": 3, \"name\": \"Charlie\", \"category\": 2},\n",
    "    {\n",
    "        \"id\": 4,\n",
    "        \"name\": \"Alice\",\n",
    "        \"category\": 1,\n",
    "    },  # Another duplicate combination of (name, category)\n",
    "]\n",
    "\n",
    "# Convert to a PyArrow table\n",
    "table = pa.Table.from_pylist(data)\n",
    "\n",
    "# Specify the key columns that define uniqueness (excluding \"id\"—the function will add it automatically)\n",
    "unique_keys = [\"name\", \"category\"]\n",
    "\n",
    "# Drop duplicates\n",
    "deduplicated_table = pyarrow_utils.drop_duplicates(table, unique_keys)\n",
    "\n",
    "# Show results\n",
    "print(\"Original Table:\")\n",
    "print(table.to_pandas())\n",
    "\n",
    "print(\"\\nDeduplicated Table (keeping first occurrence):\")\n",
    "print(deduplicated_table.to_pandas())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "parquetdb_dev",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.20"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
